SI 618: Data Manipulation and AnalysisΒΆ
04 - Visualization & Univariate StatisticsΒΆ
Dr. Chris Teplovs, School of Information, University of MichiganΒΆ
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Version 2023.09.20.1.CT
Overview of todayΒΆ
- Announcements
- Comments/Questions/Concerns
- Review last week: Aggregation & Grouping
- Today: Univariate Statistics & Visualization
import pandas as pd
import numpy as np
A quick note about Markdown cellsΒΆ
We have encouraged you to use Markdown cells to add text to your notebooks. Please see https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html for a more complete explanation of the use of Markdown in Jupyter. You can also examine any of the existing Markdown blocks by clicking on them.
IMPORTANT: Replace ? in the following code with your uniqname.ΒΆ
MY_UNIQNAME = 'yanlunar'
Aggregation and Grouping (review)ΒΆ
Group ByΒΆ
from io import StringIO
TESTDATA=StringIO("""State,Retailer,Fruit,Sales
MI,Walmart,Apple,100
MI,Wholefoods,Apple,150
MI,Kroger,Orange,180
CA,Walmart,Apple,220
CA,Wholefoods,Apple,180
CA,Safeway,Apple,220
CA,Safeway,Orange,110
NY,Walmart,Apple,90
NY,Walmart,Orange,80
NY,Wholefoods,Orange,120
""")
fruit = pd.read_csv(TESTDATA, index_col=None)
fruit.head()
| State | Retailer | Fruit | Sales | |
|---|---|---|---|---|
| 0 | MI | Walmart | Apple | 100 |
| 1 | MI | Wholefoods | Apple | 150 |
| 2 | MI | Kroger | Orange | 180 |
| 3 | CA | Walmart | Apple | 220 |
| 4 | CA | Wholefoods | Apple | 180 |
(a) What is the total sales for each state?ΒΆ
This requires us to group by state, and aggregate sales by taking the sum.
The easiest way of doing this if to use groupby
If you execute groupby on the dataframe what you'll get back is an object called DataFrameGroupBy
fruit.groupby('State')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x150323490>
On its own it's a bit useless... it just keeps track of which rows should go into each "pile" (where pile here means a unique group for each state)
If we ask this object to describe itself, you can see what is inside notice that it threw away all the other columns because they were not numerical. Only "Sales" which is a number, was kept
fruit.groupby('State').describe()
| Sales | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| State | ||||||||
| CA | 4.0 | 182.500000 | 51.881275 | 110.0 | 162.5 | 200.0 | 220.0 | 220.0 |
| MI | 3.0 | 143.333333 | 40.414519 | 100.0 | 125.0 | 150.0 | 165.0 | 180.0 |
| NY | 3.0 | 96.666667 | 20.816660 | 80.0 | 85.0 | 90.0 | 105.0 | 120.0 |
# What are the total sales for each state?
fruit.groupby('State')['Sales'].sum() # instead of size()
State CA 730 MI 430 NY 290 Name: Sales, dtype: int64
What just happend? A couple of things:
groupby()got first executed ondf, returning anDataFrameGroupByobject. This object itself is useless unless coupled with an aggregation function, such assum(),mean(),max(),apply().- Then,
sum()got executed on theDataFrameGroupByobject, generating theDataFrameobject you see above. Notice how the table looks different than the original DataFramedf? Here are the differences:- The
Statecolumn now becomes the index of the DataFrame. The string "State" is the name of the index. Notice how the index name is displayed on a lower level than column names. - Since we performed a
groupbyoperation byState, so only the unique values ofStateare kept as index. - Among the other columns, Retailer, Fruit, and Sales, only Sales is kept in the result table. This is because the aggregation function
sum()only knows how to aggregate numerical values. And only Sales is a numerical column. The other columns are hence dropped.
- The
(b) What is the total sales for each state for each fruit?ΒΆ
This requires us to perform groupby on two columns. So, we provide a list of column names to the groupby function.
Don't forget that an aggregation function needs to follow the groupby function in order to generate results.
# What is the total sales for each state for each fruit?
fruit.groupby(['State', 'Fruit'])['Sales'].sum()
State Fruit
CA Apple 620
Orange 110
MI Apple 250
Orange 180
NY Apple 90
Orange 200
Name: Sales, dtype: int64
How is this DataFrame different from the previous one?
The biggest different is that this DataFrame has what is called a MultiIndex (or hierarchical index), as opposed to a simple index. In this table, the left two "columns" are not columns but actually part of the MultiIndex, and the Sales is the single real "column" in the DataFrame. (Running out of terminologies here...)
The hierarchical index can be organized in an alternative way if we swapped the order of State and Fruit.
fruit.groupby(['Fruit', 'State'])['Sales'].sum()
Fruit State
Apple CA 620
MI 250
NY 90
Orange CA 110
MI 180
NY 200
Name: Sales, dtype: int64
(c) Which state has the maximum total sales?ΒΆ
This question is not asking about the maximum value, but rather which state holds that maximum. There are multiple ways to do it. A principled way is to use idxmax.
# Which state has the maximum total sales?
fruitSalesByState = fruit.groupby('State')['Sales'].sum()
print(fruitSalesByState)
max_state = fruitSalesByState.idxmax()
print("The state with the maximum sales is: ",max_state)
State CA 730 MI 430 NY 290 Name: Sales, dtype: int64 The state with the maximum sales is: CA
What if we wanted to get the sales value of CA again?
fruitSalesByState[max_state]
730
(d) Which state has the maximum total sales for apples?
# Which state has the maximum total sales for apples?
# give me apple sellers
apples = fruit[fruit.Fruit == 'Apple']
apples
| State | Retailer | Fruit | Sales | |
|---|---|---|---|---|
| 0 | MI | Walmart | Apple | 100 |
| 1 | MI | Wholefoods | Apple | 150 |
| 3 | CA | Walmart | Apple | 220 |
| 4 | CA | Wholefoods | Apple | 180 |
| 5 | CA | Safeway | Apple | 220 |
| 7 | NY | Walmart | Apple | 90 |
# aggr. by state
applesByState = apples.groupby('State')['Sales'].sum()
applesByState
State CA 620 MI 250 NY 90 Name: Sales, dtype: int64
Visualization and Univariate StatisticsΒΆ
Think about plotting the relationship between X and Y for the following data:
(https://en.wikipedia.org/wiki/Anscombe's_quartet)
A nice linear relationship, right?
(https://en.wikipedia.org/wiki/Anscombe's_quartet)
Anscombe's QuartetΒΆ
Matplotlib, the basis for visualization in pythonΒΆ
Matplotlib home page: https://matplotlib.org/index.html
matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
matplotlib.pyplotΒΆ
- collection of functions that make matplotlib work like MATLAB (is that helpful???)
- each function makes some change to a figure:
- create a figure
- create a plotting area in a figure
- plots some lines in a plotting area
- decorates the plot with labels, etc.
- states are preserved across function calls
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()
Q1: Where did the numbers on the x-axis come from?ΒΆ
Insert your answer here
To specify x- and y-values:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
[<matplotlib.lines.Line2D at 0x150401fc0>]
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [10,4,9,10])
plt.show()
Note default shape is "b-", which means a blue line
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [10,4,9,10], 's')
plt.show()
We can explicitly set the bounds for the axes:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [10,4,9,10], 's')
plt.axis([0, 6, 0, 20])
plt.show()
Let's generate some data to play with:
import numpy as np
import matplotlib.pyplot as plt
# evenly sampled time at 200ms intervals
t = np.arange(0., 5.2, 0.2)
t
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4,
2.6, 2.8, 3. , 3.2, 3.4, 3.6, 3.8, 4. , 4.2, 4.4, 4.6, 4.8, 5. ])
# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()
# listing of all marker types
markers = {'.': 'point', ',': 'pixel', 'o': 'circle',
'v': 'triangle_down', '^': 'triangle_up', '<': 'triangle_left', '>': 'triangle_right',
'1': 'tri_down', '2': 'tri_up', '3': 'tri_left', '4': 'tri_right', '8': 'octagon',
's': 'square', 'p': 'pentagon', '*': 'star', 'h': 'hexagon1', 'H': 'hexagon2', '+': 'plus',
'x': 'x', 'D': 'diamond', 'd': 'thin_diamond', '|': 'vline', '_': 'hline',
'P': 'plus_filled', 'X': 'x_filled', 0:
'tickleft', 1: 'tickright', 2: 'tickup', 3: 'tickdown',
4: 'caretleft', 5: 'caretright', 6: 'caretup', 7: 'caretdown',
8: 'caretleftbase', 9: 'caretrightbase', 10: 'caretupbase', 11: 'caretdownbase',
'None': 'nothing', None: 'nothing', ' ': 'nothing', '': 'nothing'}
plt.plot(t, t, marker='o', linestyle='', color='r')
plt.plot(t, t**2, marker='s', linestyle='', color='b')
plt.plot(t, t**3, marker='^', linestyle='', color='g')
plt.show()
Q2: Try some other marker stylesΒΆ
# insert your code here
plt.plot(t, t, marker='*', linestyle='', color='r')
plt.plot(t, t**2, marker='p', linestyle='', color='b')
plt.plot(t, t**3, marker='<', linestyle='', color='g')
plt.show()
Setting line propertiesΒΆ
x, y = [1, 2, 3, 4], [1, 4, 9, 16]
x
[1, 2, 3, 4]
y
[1, 4, 9, 16]
Keyword args:
plt.plot(x, y, linewidth=2.0)
[<matplotlib.lines.Line2D at 0x15060a380>]
setter methods:
line, = plt.plot(x, y, '-')
line.set_antialiased(False)
setp()
line = plt.plot(x,y)
# use keyword args
plt.setp(line, color='r', linewidth=2.0)
# or MATLAB style string value pairs
plt.setp(line, 'color', 'r', 'linewidth', 2.0)
plt.show()
Multiple plotsΒΆ
def f(t):
return np.exp(-t) * np.cos(2*np.pi*t)
t1 = np.arange(0.0, 5.1, 0.1)
t2 = np.arange(0.0, 5.02, 0.02)
t1
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5,
2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8,
3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. ])
t2
array([0. , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14, 0.16, 0.18, 0.2 ,
0.22, 0.24, 0.26, 0.28, 0.3 , 0.32, 0.34, 0.36, 0.38, 0.4 , 0.42,
0.44, 0.46, 0.48, 0.5 , 0.52, 0.54, 0.56, 0.58, 0.6 , 0.62, 0.64,
0.66, 0.68, 0.7 , 0.72, 0.74, 0.76, 0.78, 0.8 , 0.82, 0.84, 0.86,
0.88, 0.9 , 0.92, 0.94, 0.96, 0.98, 1. , 1.02, 1.04, 1.06, 1.08,
1.1 , 1.12, 1.14, 1.16, 1.18, 1.2 , 1.22, 1.24, 1.26, 1.28, 1.3 ,
1.32, 1.34, 1.36, 1.38, 1.4 , 1.42, 1.44, 1.46, 1.48, 1.5 , 1.52,
1.54, 1.56, 1.58, 1.6 , 1.62, 1.64, 1.66, 1.68, 1.7 , 1.72, 1.74,
1.76, 1.78, 1.8 , 1.82, 1.84, 1.86, 1.88, 1.9 , 1.92, 1.94, 1.96,
1.98, 2. , 2.02, 2.04, 2.06, 2.08, 2.1 , 2.12, 2.14, 2.16, 2.18,
2.2 , 2.22, 2.24, 2.26, 2.28, 2.3 , 2.32, 2.34, 2.36, 2.38, 2.4 ,
2.42, 2.44, 2.46, 2.48, 2.5 , 2.52, 2.54, 2.56, 2.58, 2.6 , 2.62,
2.64, 2.66, 2.68, 2.7 , 2.72, 2.74, 2.76, 2.78, 2.8 , 2.82, 2.84,
2.86, 2.88, 2.9 , 2.92, 2.94, 2.96, 2.98, 3. , 3.02, 3.04, 3.06,
3.08, 3.1 , 3.12, 3.14, 3.16, 3.18, 3.2 , 3.22, 3.24, 3.26, 3.28,
3.3 , 3.32, 3.34, 3.36, 3.38, 3.4 , 3.42, 3.44, 3.46, 3.48, 3.5 ,
3.52, 3.54, 3.56, 3.58, 3.6 , 3.62, 3.64, 3.66, 3.68, 3.7 , 3.72,
3.74, 3.76, 3.78, 3.8 , 3.82, 3.84, 3.86, 3.88, 3.9 , 3.92, 3.94,
3.96, 3.98, 4. , 4.02, 4.04, 4.06, 4.08, 4.1 , 4.12, 4.14, 4.16,
4.18, 4.2 , 4.22, 4.24, 4.26, 4.28, 4.3 , 4.32, 4.34, 4.36, 4.38,
4.4 , 4.42, 4.44, 4.46, 4.48, 4.5 , 4.52, 4.54, 4.56, 4.58, 4.6 ,
4.62, 4.64, 4.66, 4.68, 4.7 , 4.72, 4.74, 4.76, 4.78, 4.8 , 4.82,
4.84, 4.86, 4.88, 4.9 , 4.92, 4.94, 4.96, 4.98, 5. ])
plt.figure(1)
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')
plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()
# Simple data to display in various forms
x = np.linspace(0, 2 * np.pi, 400)
y = np.sin(x ** 2)
Just a figure and one subplot:
f, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')
plt.show()
Two subplots, the axes array is 1-d
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, y)
axarr[0].set_title('Sharing X axis')
axarr[1].scatter(x, y)
plt.show()
Two subplots, unpack the axes array immediately
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing Y axis')
ax2.scatter(x, y)
<matplotlib.collections.PathCollection at 0x150a159f0>
Three subplots sharing both x/y axes
f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
plt.show()
Fine-tune figure; make subplots close to each other and hide x ticks for all but bottom plot.
f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
f.subplots_adjust(hspace=0)
plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=False)
plt.show()
Row and column sharing:
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')
ax1.plot(x, y)
ax1.set_title('Sharing x per column, y per row')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
ax4.plot(x, 2 * y ** 2 - 1, color='r')
plt.show()
Adding TextΒΆ
# Fixing random state for reproducibility
np.random.seed(618)
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
# the histogram of the data
n, bins, patches = plt.hist(x, 50, density=True, facecolor='g', alpha=0.75)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(55, .025, 'mean=100, SD=15')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()
See also text properties and layout.
Annotating TextΒΆ
ax = plt.subplot(111)
t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = plt.plot(t, s, lw=2)
plt.annotate('local max',
xy=(2, 1),
xytext=(3, 1.5),
arrowprops=dict(facecolor='black', shrink=0.05),
)
plt.ylim(-2,2)
plt.show()
Q3: Using the McDonald's menu dataset, plot any 2 continuous variables as a scatterplot and annotate an interesting feature (e.g. local max, outlier, etc.).ΒΆ
menu = pd.read_csv('https://github.com/umsi-data-science/data/raw/main/menu.csv')
menu.head()
| Category | Item | Serving Size | Calories | Calories from Fat | Total Fat | Total Fat (% Daily Value) | Saturated Fat | Saturated Fat (% Daily Value) | Trans Fat | ... | Carbohydrates | Carbohydrates (% Daily Value) | Dietary Fiber | Dietary Fiber (% Daily Value) | Sugars | Protein | Vitamin A (% Daily Value) | Vitamin C (% Daily Value) | Calcium (% Daily Value) | Iron (% Daily Value) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breakfast | Egg McMuffin | 4.8 oz (136 g) | 300 | 120 | 13.0 | 20 | 5.0 | 25 | 0.0 | ... | 31 | 10 | 4 | 17 | 3 | 17 | 10 | 0 | 25 | 15 |
| 1 | Breakfast | Egg White Delight | 4.8 oz (135 g) | 250 | 70 | 8.0 | 12 | 3.0 | 15 | 0.0 | ... | 30 | 10 | 4 | 17 | 3 | 18 | 6 | 0 | 25 | 8 |
| 2 | Breakfast | Sausage McMuffin | 3.9 oz (111 g) | 370 | 200 | 23.0 | 35 | 8.0 | 42 | 0.0 | ... | 29 | 10 | 4 | 17 | 2 | 14 | 8 | 0 | 25 | 10 |
| 3 | Breakfast | Sausage McMuffin with Egg | 5.7 oz (161 g) | 450 | 250 | 28.0 | 43 | 10.0 | 52 | 0.0 | ... | 30 | 10 | 4 | 17 | 2 | 21 | 15 | 0 | 30 | 15 |
| 4 | Breakfast | Sausage McMuffin with Egg Whites | 5.7 oz (161 g) | 400 | 210 | 23.0 | 35 | 8.0 | 42 | 0.0 | ... | 30 | 10 | 4 | 17 | 2 | 21 | 6 | 0 | 25 | 10 |
5 rows Γ 24 columns
menu.columns
Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
dtype='object')
# insert your code here
plt.scatter(menu['Sodium'], menu['Calories'])
plt.xlabel('Sodium')
plt.ylabel('Calories')
plt.annotate("Don't eat this!", xy=(3500, 1800), xytext=(3000, 1400), arrowprops=dict(facecolor='black', shrink=0.05))
Text(3000, 1400, "Don't eat this!")
Using specific types of plots via pyplotΒΆ
In addition to scatterplots, pyplot offers a number of other plot types. These can be accessed via convenience functions such as scatter(), hist(), bar(), barh(), and pie(), amongst others:
plt.scatter(menu["Calories"], menu["Saturated Fat"])
plt.show()
plt.hist(menu['Calories'])
plt.show()
Q4: Create a histogram of any one of the continuous variables from the McDonalds menu dataset.
# insert your code here
plt.hist(menu['Calories'], bins=20)
# plt.hist(menu['Calories'])
(array([21., 34., 53., 39., 39., 33., 12., 15., 5., 3., 2., 2., 1.,
0., 0., 0., 0., 0., 0., 1.]),
array([ 0., 94., 188., 282., 376., 470., 564., 658., 752.,
846., 940., 1034., 1128., 1222., 1316., 1410., 1504., 1598.,
1692., 1786., 1880.]),
<BarContainer object of 20 artists>)
Pandas and matplotlib integrationΒΆ
Cumbersome? Yes. A better way? Use the matplotlib integration from pandas:
f = menu['Calories'].plot(kind='hist')
type(f)
matplotlib.axes._axes.Axes
Here are the valid values for "kind":
kind : - 'line' : line plot (default) - 'bar' : vertical bar plot - 'barh' : horizontal bar plot - 'hist' : histogram - 'box' : boxplot - 'kde' : Kernel Density Estimation plot - 'density' : same as 'kde' - 'area' : area plot - 'pie' : pie plot
Bar plots with groupby()ΒΆ
categories = menu.groupby('Category').size()
categories.plot(kind='barh')
<Axes: ylabel='Category'>
categories_sorted = categories.sort_values(ascending=True)
categories_sorted.plot(kind='barh')
<Axes: ylabel='Category'>
Q5: Create a new column in the menu DataFrame called "Sugary" whose value is 1 if the values of "Sugars" is greater than 20, otherwise set it to 0.
Hint: use np.where(...) (look it up in the documentation)
# insert your code here
menu['Sugary']=np.where(menu['Sugars'] > 10, 'High', 'Low')
Create a stacked bar plot by using a 2-level groupby() followed by an unstack():ΒΆ
menu.groupby(["Category", "Sugary"]).size().unstack()
| Sugary | High | Low |
|---|---|---|
| Category | ||
| Beef & Pork | 3.0 | 12.0 |
| Beverages | 18.0 | 9.0 |
| Breakfast | 14.0 | 28.0 |
| Chicken & Fish | 7.0 | 20.0 |
| Coffee & Tea | 85.0 | 10.0 |
| Desserts | 6.0 | 1.0 |
| Salads | 1.0 | 5.0 |
| Smoothies & Shakes | 28.0 | NaN |
| Snacks & Sides | 1.0 | 12.0 |
menu.groupby(["Category", "Sugary"]).size().unstack().plot(kind="bar")
<Axes: xlabel='Category'>
menu.groupby(["Category", "Sugary"]).size().unstack().plot(kind="bar", stacked=True)
<Axes: xlabel='Category'>
menu.groupby(['Category', 'Sugary']).size().groupby(by='Category', group_keys=False).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar', stacked=True)
<Axes: xlabel='Category'>
Q6: Repeat the above steps to generate three bar plots for any other continuous variable that you split into "high" and "low" values, just as with did with "Sugars" above.
# insert your code here
menu.columns
Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
'Calcium (% Daily Value)', 'Iron (% Daily Value)', 'Sugary'],
dtype='object')
menu['Protein'].describe()
count 260.000000 mean 13.338462 std 11.426146 min 0.000000 25% 4.000000 50% 12.000000 75% 19.000000 max 87.000000 Name: Protein, dtype: float64
menu['Protein Content']=np.where(menu['Protein'] >= 15, 'High', 'Low')
menu.groupby(['Category', 'Protein Content']).size().unstack().plot(kind='bar', stacked=True)
<Axes: xlabel='Category'>
Pie ChartsΒΆ
There are many issues with pie charts, and the one below is a good example of what not to do, but everyone wants to know how to make them:
# don't do this...
categories.plot(kind='pie', title='Undecipherable Menu Categories')
<Axes: title={'center': 'Undecipherable Menu Categories'}>
Q7: What type of plot can always replace a pie chart?
categories.plot(kind='bar')
<Axes: xlabel='Category'>
bar chart
Subplots (again)ΒΆ
In addition to the way we used subplots in the previous class, we can use the .subplots() function to generate mulitple plots within a figure. subplots() returns a set of axes on which we can make plots.
To demonstrate how this works, let's fill in just one of the subplots:
f, (ax1, ax2) = plt.subplots(2) # if only 1 argument, we assume it's the number of rows
ax1.hist(menu['Calories'])
plt.show()
Now let's fill in both subplots:
f, (ax1, ax2) = plt.subplots(2)
ax1.hist(menu['Calories'])
ax2.plot(menu['Calories'], menu['Total Fat'], 'bo')
plt.show()
Now let's make a 2x2 layout of 4 plots. Note the structure of the return values from the subplots function:
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
ax1.hist(menu['Calories'])
ax2.plot(menu['Calories'], menu['Total Fat'], 'bo')
plt.show()
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
ax1.hist(menu['Total Fat'])
ax2.plot(menu['Calories'], menu['Total Fat'], 'bo')
ax3.hist(menu['Dietary Fiber'])
ax4.plot(menu['Calories'], menu['Total Fat'], 'bo')
plt.show()
Alternatively, we can use the pandas-matplotlib integration. Note the use of the ax= keyword arg.
f, (ax1, ax2) = plt.subplots(2)
menu['Calories'].plot(ax=ax1, kind='hist')
menu['Dietary Fiber'].plot(ax=ax2, kind='hist')
plt.show()
Q8: Use subplots() to create a figure consisting of 4 plots.
They could be scatter plots, histograms, bar charts, pie plots, or any of the kinds (repeated here for your convenience):
kind :
- 'line' : line plot (default)
- 'bar' : vertical bar plot
- 'barh' : horizontal bar plot
- 'hist' : histogram
- 'box' : boxplot
- 'kde' : Kernel Density Estimation plot
- 'density' : same as 'kde'
- 'area' : area plot
- 'pie' : pie plot
# insert your code here
BREAKΒΆ
Simple (univariate) statisticsΒΆ
# The weights of a herd of 80 cows
# Fixing random state for reproducibility
np.random.seed(618)
measures = (np.random.standard_normal(80)*150+1000).astype(int)
measures
array([1167, 1075, 893, 1066, 1025, 1084, 665, 1021, 1085, 886, 1161,
959, 1008, 1004, 953, 1050, 1091, 857, 881, 891, 1048, 1213,
1094, 1031, 935, 1014, 1054, 933, 1053, 1026, 844, 996, 1115,
836, 884, 901, 801, 1226, 1091, 1448, 1119, 1215, 1199, 1030,
1163, 987, 925, 1127, 1108, 1146, 1400, 984, 1035, 841, 1015,
922, 1042, 1026, 888, 987, 933, 850, 1237, 1018, 955, 1308,
817, 1146, 1022, 884, 1200, 979, 787, 1056, 1105, 1101, 1054,
1285, 770, 710])
# equivalently
# Fixing random state for reproducibility
np.random.seed(618)
measures = (np.random.normal(1000,150,80)).astype(int)
measures
array([1167, 1075, 893, 1066, 1025, 1084, 665, 1021, 1085, 886, 1161,
959, 1008, 1004, 953, 1050, 1091, 857, 881, 891, 1048, 1213,
1094, 1031, 935, 1014, 1054, 933, 1053, 1026, 844, 996, 1115,
836, 884, 901, 801, 1226, 1091, 1448, 1119, 1215, 1199, 1030,
1163, 987, 925, 1127, 1108, 1146, 1400, 984, 1035, 841, 1015,
922, 1042, 1026, 888, 987, 933, 850, 1237, 1018, 955, 1308,
817, 1146, 1022, 884, 1200, 979, 787, 1056, 1105, 1101, 1054,
1285, 770, 710])
Exercise 9:ΒΆ
Calculate some summary statistics for the measures variable you just created.
# insert your code here
pd.Series(measures).describe()
count 80.000000 mean 1021.762500 std 145.408481 min 665.000000 25% 924.250000 50% 1025.500000 75% 1102.000000 max 1448.000000 dtype: float64
Visualizing data with SeabornΒΆ
- Visualization package built on top of matplotlib
- It's meant to make your life better
- Plays well with pandas, numpy, scipy, and statsmodels
- Many different visualization are included:
- Strip plots, Swarm plots, Violin plots
- Box plots
- Histograms
We need to import the package, and it's typically imported as sns:
import seaborn as sns
Seaborn versus MatplotlibΒΆ
- Matplotlib
- Low-level, basis for many packages
- Painful to construct certain graphs
- Not Pandas friendly
- Not interactive
- Seaborn
- Pandas friendlier
- Great for some stats plots
import seaborn as sns
Strip PlotΒΆ
sns.stripplot(x=measures)
<Axes: >
Swarm PlotΒΆ
sns.swarmplot(x=measures)
<Axes: >
Violin PlotΒΆ
- If we have too much data
sns.violinplot(x=measures)
<Axes: >
Box PlotΒΆ
sns.boxplot(x=measures)
<Axes: >
And we can manipulate the underlying plot to control different features. See
https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots-in-matplotlib-pyplot-python#34162641
and
https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplots for explanations about plt.subplots()
fig, ax = plt.subplots()
fig.set_size_inches(11, 8)
medianprops = dict(linewidth=2.5, color='firebrick')
sns.boxplot(medianprops=medianprops, data=measures, orient='h')
_ = plt.xticks(rotation=90)
HistogramΒΆ
We're going to use this a lot. Seaborn puts a nice smooth line over a distribution. We'll talk about that soon, but for now just think about it as an extrapolation: if we had a bunch more data, the distribution would eventually smooth out to something that looks like the line.
# x axis = value, y axis = count (frequency)
sns.histplot(measures, kde=False)
<Axes: ylabel='Count'>
# note use of displot rather than histplot
sns.displot(measures, rug=True); # show a strip plot on bottom -- we call it a "rug"
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Visualizing the menu data using SeabornΒΆ
sns.histplot(menu.Calories)
<Axes: xlabel='Calories', ylabel='Count'>
sns.histplot(menu['Total Fat'])
<Axes: xlabel='Total Fat', ylabel='Count'>
# Relationship between petal_width & petal_length by species
sns.relplot(x="Calories", y="Total Fat",# hue="Category",
sizes=(40, 400), alpha=.5,
height=6, data=menu)
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x150b76170>
# Relationship between petal_width & petal_length by species
sns.relplot(x="Calories", y="Total Fat",hue="Category",
sizes=(40, 400), alpha=.5,
height=6, data=menu)
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x150bb9c90>
sns.jointplot(x='Calories', y='Total Fat', data=menu)
<seaborn.axisgrid.JointGrid at 0x1507bc160>
It's a bit difficult to see where the interesting areas in the plot are, so it's worth trying a hexbin plot. Go ahead and copy the above code block and add kind="hex" to the jointplot parameters.
# insert your code here
Finally, you may want to look at all the numeric variables in your
dataset. Use pairplot to do this:
sns.pairplot(menu.query("Category == 'Breakfast'"),
vars=['Calories', 'Total Fat', 'Protein', 'Dietary Fiber'])
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x132fcd660>
We can get fancier by using a different column to set the color (or "hue"):
Try running the following code:
sns.pairplot(data=menu,
hue="Category",
vars=['Calories', 'Total Fat', 'Protein', 'Dietary Fiber'])
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x14fc4a500>
UniformΒΆ
uniform = np.random.uniform(-2, 2, 1000) # low,high,count
sns.displot(uniform, kde=False)
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x14fda4be0>
BimodalΒΆ
bimodal = np.append(np.random.normal(-20, 10, 100),
np.random.normal(20, 10, 100))
sns.displot(bimodal)
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x14fc220e0>
PoissonΒΆ
$$ P(k~events~in~interval) = \frac{\lambda^ke^{-\lambda}}{k!} $$
$\lambda$ is the event rate
Examples
- Meteor strikes
- Arrival of patients to hospital
# as lambda goes up --> looks more normal
pois = np.random.poisson(3, 100000) # lambda, count
sns.displot(pois)
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x1339a0dc0>
Power/Zipf/ParetoΒΆ
$$ P = \frac{x^{-a}}{\zeta(a)}$$
"long tail"
- degree distribution
- movie/music popularity
- words
power = np.random.zipf(2, 100000)
sns.displot(power, log_scale=(True, True))
/Users/luyan/Documents/UM/23Fall/si618/.venv/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x14fdc6440>
Visual Tests on DataΒΆ
testdata = (np.random.standard_normal(500)*20+150).astype(int)
sns.histplot(testdata)
<Axes: ylabel='Count'>
Run SequenceΒΆ
- Run Sequence (index versus value)
- flat and non-drifting
- fixed-location assumption holds
- vertical spread same over the entire plot,
- then the fixed-variation assumption holds.
ax = sns.regplot(x=np.arange(len(testdata)), y=testdata, fit_reg=False)
ax.set_ylim(0, 250)
ax.set_ylabel("val")
ax.set_xlabel("i")
Text(0.5, 0, 'i')
drifting = np.array([testdata[i]+i*(.1) for i in np.arange(len(testdata))])
sns.histplot(data=drifting, kde=False)
<Axes: ylabel='Count'>
ax = sns.regplot(x=np.arange(len(drifting)), y=drifting, fit_reg=False)
ax.set_ylim(0, 300)
ax.set_ylabel("val")
ax.set_xlabel("i")
Text(0.5, 0, 'i')
expanding = np.array([(testdata[i]+i*np.random.randint(-1,1)*.2)
for i in np.arange(len(testdata))])
sns.histplot(expanding)
<Axes: ylabel='Count'>
ax = sns.regplot(x=np.arange(len(expanding)), y=expanding)
ax.set_ylim(0, 300)
ax.set_ylabel("val")
ax.set_xlabel("i")
Text(0.5, 0, 'i')
Lag PlotΒΆ
- Plot point $y_i$ versus $y_{i-1}$
- If the lag plot is structureless
- randomness assumption holds.
lag = testdata.copy()
lag = np.array(lag[:-1])
current = testdata[1:]
ax = sns.regplot(x=current, y=lag, fit_reg=False)
ax.set_ylabel("y_i-1")
ax.set_xlabel("y_i")
Text(0.5, 0, 'y_i')
connected = np.array([testdata[i]+testdata[i-1] for i in np.arange(500)])
sns.histplot(connected)
<Axes: ylabel='Count'>
lag = connected.copy()
lag = np.array(lag[:-1])
current = connected[1:]
ax = sns.regplot(x=current, y=lag, fit_reg=False)
ax.set_ylabel("y_i-1")
ax.set_xlabel("y_i")
Text(0.5, 0, 'y_i')
QQ PlotΒΆ
- QQ Plots takes our n ordered data points
- sorted from smallest to largest
- Asks:
- What is the relationship between quantiles from our data and quantiles from a theoretical distribution that we're assuming the sample is drawn from
from scipy import stats
qntls, xr = stats.probplot(testdata, fit=False)
sns.regplot(y=xr, x=qntls)
<Axes: >
def random_snorm(n, mean = 0, sd = 1, xi = 1.5):
def random_snorm_aux(n, xi):
weight = xi/(xi + 1/xi)
z = np.random.uniform(-weight,1-weight,n)
xi_ = xi**np.sign(z)
random = -np.absolute(np.random.normal(0,1,n))/xi_ * np.sign(z)
m1 = 2/np.sqrt(2 * np.pi)
mu = m1 * (xi - 1/xi)
sigma = np.sqrt((1 - m1**2) * (xi**2 + 1/xi**2) + 2 * m1**2 - 1)
return (random - mu)/sigma
return random_snorm_aux(n, xi) * sd + mean
rightskewed = random_snorm(1000, xi=2)*100
sns.histplot(rightskewed, kde=False)
<Axes: ylabel='Count'>
qntls, xr = stats.probplot(rightskewed, fit=False)
sns.regplot(x=xr, y=qntls)
<Axes: >
leftskewed = random_snorm(1000, xi=-2)*100
sns.histplot(leftskewed, kde=False)
<Axes: ylabel='Count'>
qntls, xr = stats.probplot(leftskewed, fit=False)
sns.regplot(x=xr, y=qntls)
<Axes: >
Now the serious plots... let's wrap them in a single function that we can callΒΆ
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
def multiplePlots(series):
fig, axs = plt.subplots(2, 2)
plt.tight_layout(pad=0.4, w_pad=4, h_pad=1.0)
# Histogram
sns.histplot(series, ax=axs[0, 0])
# Lag plot
lag = series.copy()
lag = np.array(lag[:-1])
current = series[1:]
ax = sns.regplot(x=current, y=lag, fit_reg=False, ax=axs[0, 1])
ax.set_ylabel("y_i-1")
ax.set_xlabel("y_i")
# QQ plot (updated 2023.08.18.1.CT to match usual QQ plot -- axes flipped)
qntls, xr = stats.probplot(series, fit=False)
sns.regplot(y=xr, x=qntls, ax=axs[1, 0])
# Run sequence
ax = sns.regplot(x=np.arange(len(series)), y=series, ax=axs[1, 1])
ax.set_ylabel("val")
ax.set_xlabel("i")
Now run this on measuresΒΆ
multiplePlots(measures)
Q12: Do your data look normally distributed?ΒΆ
Yes. In the bar chart, the points distributed normally around the mean and the deviation is quite symmetry. Also, in Rag, QQ and Run plot, the points are also well distributed
Q13: ΒΆ
The sample.csv file, loaded for you in the next cell, contains 9 variables (v0 through v8) that contain measures drawn from different distributions. Your task is to use the investigative techniques we discussed in today's lab to determine what type of distribution the sample is drawn from.
You should first load the CSV file into a DataFrame, then look at various aspects of each variable.
Your responses should consist of code cells, as well as markdown cells that state something like:
Variable v99 appears to be drawn from a uniform distribution with mean X and standard deviation Y.
A histogram of the data appears to be... The QQ plot shows....
Note: try not to repeat your code. You can use loops, functions, etc. to avoid repeating yourself. (remember, DRY).
distributions = pd.read_csv('https://raw.githubusercontent.com/umsi-data-science/data/main/sample.csv')
for i in range(9):
multiplePlots(distributions[distributions.columns[i]])
multiplePlots(distributions['v0'])
normal distribution
multiplePlots(distributions['v1'])
right-skewed
multiplePlots(distributions['v2'])
left-skewed
(